feat(bootstrap,cli): switch GPU injection to CDI where supported#495
feat(bootstrap,cli): switch GPU injection to CDI where supported#495
Conversation
| /// | Input | Output | | ||
| /// |--------------|--------------------------------------------------------------| | ||
| /// | `[]` | `[]` — no GPU | | ||
| /// | `["legacy"]` | `["legacy"]` — pass through | | ||
| /// | `["auto"]` | `["nvidia.com/gpu=all"]` if CDI supported, else `["legacy"]` | | ||
| /// | `[cdi-ids…]` | unchanged | | ||
| pub(crate) fn resolve_gpu_device_ids(gpu: &[String], docker_version: Option<&str>) -> Vec<String> { |
There was a problem hiding this comment.
It feels weird to me to overload this flag with legacy and auto if it is meant to be the list of device_ids in the end.
Does it make more sense to add a new flag with the mode that accepts auto. legacy or cdi and then have --gpus (or your new --devices alias) accept the CDI devices if/only if its cdi mode (or auto mode choosing cdi).
Is the concern backwards compatibility with the existing semantics of the --gpu boolean flag?
There was a problem hiding this comment.
Yes, the reason I wanted to extend the existing flag is to maintain backward compatibility. I was initially going to add a separate --device flag to mirror what we have done in other runtimes, but this would require more user engagement.
It sould also be noted that --gpu is equivalent to --gpu="auto" so that the UX does not change. I also only added legacy as an option to allow users to explicitly opt out in the cases where CDI injection is not doing what it should. In the medium term, I would expect legacy to be removed entirely (or just mapping to nvidia.com/gpu=all).
b1e6015 to
dd2682c
Compare
|
808270c to
f304997
Compare
f304997 to
501b0c1
Compare
abee8e5 to
7389f4b
Compare
architecture/gateway-single-node.md
Outdated
| |---|---| | ||
| | `--gpu` | Auto-select: CDI when enabled on the daemon, `--gpus all` otherwise | | ||
| | `--gpu=legacy` | Force `--gpus all` | | ||
| | `--gpu=<cdi-device>` | Inject a specific CDI device (e.g. `nvidia.com/gpu=all`). May be repeated for multiple devices. Note: because the cluster container runs privileged, device-level isolation may not work as expected. | |
There was a problem hiding this comment.
This seems problematic, if we actually want to limit this to something other than all.
because the cluster container runs privileged, device-level isolation may not work as expected
There was a problem hiding this comment.
CDI is about more than devices. One may have a CDI spec that includes other mounts that are intended for the sandbox use case and but not general workloads. In the case of the GPU Operator and its components, for example, we use the management.nvidia.com/gpu=all device for bootstrapping the system.
With that in mind, do you have a problem with the option at all (e.g. allowing additional injection in the case of testing or onboarding other vendors), or just the wording of the comment?
There was a problem hiding this comment.
If the container is running as privileged and someone passes nvidia.com/gpu=0, they are going to see all GPUs, not just GPU 0.
There was a problem hiding this comment.
Yes, that's what the comment is trying to convey. Please note that this is for the gateway container which acts as the SINGLE k3s node. The sandbox containers are scheduled by k3s with nvidia.com/gpu extended resource requests and each of these would have only a single GPU. (They also do not run as privileged).
As a follow-up, we could consider adding logic similar to what we did in nvkind to remove unrequested devices, but I would see that as an improvement / fix for this behaviour.
There was a problem hiding this comment.
I guess what I'm trying to say is that maybe we should somehow limit it to all or nothing. It's a weird user experience to say we support passing in an arbitrary CDI device, but then not honoring what it says.
There was a problem hiding this comment.
From the perspective of the tooling here, the device name is an arbitrary string and apart from our convention to use all in the name there is nothing that indicates that the CDI device targets one or more devices.
I can update this to check for a device name all?
There was a problem hiding this comment.
After an offline discussion, I have removed the ability to specify a CDI device name (and the --device alias).
Use an explicit CDI device request (driver="cdi", device_ids=["nvidia.com/gpu=all"]) when the Docker daemon reports CDI spec directories via GET /info (SystemInfo.CDISpecDirs). This makes device injection declarative and decouples spec generation from consumption. When the daemon reports no CDI spec directories, fall back to the legacy NVIDIA device request (driver="nvidia", count=-1) which relies on the NVIDIA Container Runtime hook. Failure modes for both paths are equivalent: a missing or stale NVIDIA Container Toolkit installation will cause container start to fail. CDI spec generation is out of scope for this change; specs are expected to be pre-generated out-of-band, for example by the NVIDIA Container Toolkit. Signed-off-by: Evan Lezar <elezar@nvidia.com>
The --gpu flag on `gateway start` now accepts an optional value: --gpu Auto-select: CDI on Docker >= 28.2.0, legacy otherwise --gpu=legacy Force the legacy nvidia DeviceRequest (driver="nvidia") Internally, the gpu bool parameter to ensure_container is replaced with a device_ids slice. resolve_gpu_device_ids resolves the "auto" sentinel to a concrete device ID list based on the Docker daemon version, keeping the resolution logic in one place at deploy time. Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
7389f4b to
c07c0f8
Compare
Summary
Switch GPU device injection in cluster bootstrap to use CDI (Container Device Interface) when enabled in Docker (the
docker infoendpoint returns a non-empty list of CDI spec directories). When this is not the case existing--gpus allNVIDIADeviceRequestpath is used as a fallback. The--gpuflag ongateway startis extended to let users force the legacy injection mode.Related Issue
Part of #398
Changes
feat(bootstrap): Auto-select CDI (driver="cdi",device_ids=["nvidia.com/gpu=all"]) if CDI is enabled on the daemon; fall back to legacydriver="nvidia"on older daemons or when CDI spec dirs are absentfeat(cli):--gpunow accepts an optional value: omit for auto-select,--gpu=legacyto force the legacy--gpus allpathtest(e2e): Extend gateway start help smoke test to cover--gpuand--recreateflagsTesting
mise run pre-commitpassesresolve_gpu_device_idscoverage)Checklist